Understanding How Audit Factors Affecting Case Sales


Background and Introduction

The data on the following sheet is some sample data that has been anonymized. It is a combination of case sales, machines present, & other audit data that we’ve collected on a store level.

1.) Our key metric that we look at is asset utilization. We want to know how well each of our machines are performing from a case sales perspective. Poorly performing machines are not profitable.

2.) It is normal to see a fractional machine number. We give our accounts the benefit of taking into account when a machine is placed. A whole number indicates that the machine has been there for a full 12 months (i.e. a machine value of 0.2 would indicate that the machine has only been at that location for a few months).

3.) Our business model is to sell cases of product to a distributor who in turn sells it to a store. We maintain a relationship with each store because we place a Frazil-owned machine at each location. You will see multiple stores roll up to a single distributor.


Methods and Results

The following table displays the variable names in this data set, along with their descriptions.

Variable Description
Cases Number of sale cases in a store (response variable)
Location ID Unique Identifier of a store
Location City as the name show
Location States as the name show
Location Zip Zip Code
Machines number of machine at the store for more than 1 year
Price_12oz Price for 12 oz product
Price_20oz Price for 20 oz product
Price_32oz Price for 32 oz product
Broken If the machine is not working how long has it been broken?
Rate rate of the cleanness and experience in the store

I start by applying basic summary and exploratory statistics to this data to better understand the data and identify trends, and I choose the variables that are most related to the purpose of the analysis, Cases, Machines, Price_12oz, Price_20oz, Price_32oz, Broken, and Rate. And I will use multiple linear regression for the analysis.

# read excel file
mock <- read.csv("Mock Data Analysis-1.csv", header = TRUE, sep = ",") %>% 
  select("Cases", "Machines", 
         "Q3...1..What.are.the.current.prices.for.a.12oz..20oz..and.32oz.cup.of.Frazil....Price...12.oz.",
         "Q3...2..What.are.the.current.prices.for.a.12oz..20oz..and.32oz.cup.of.Frazil....Price...20.oz.",
         "Q3...3..What.are.the.current.prices.for.a.12oz..20oz..and.32oz.cup.of.Frazil....Price...32.oz.",
         "Q4..If.the.machine.is.not.working..how.long.has.it.been.broken.",
         "Q5..How.would.you.rate.the.cleanliness.and.experience.in.the.store.") %>% 
  rename(Price_12oz = "Q3...1..What.are.the.current.prices.for.a.12oz..20oz..and.32oz.cup.of.Frazil....Price...12.oz.",
         Price_20oz = "Q3...2..What.are.the.current.prices.for.a.12oz..20oz..and.32oz.cup.of.Frazil....Price...20.oz.",
         Price_32oz = "Q3...3..What.are.the.current.prices.for.a.12oz..20oz..and.32oz.cup.of.Frazil....Price...32.oz.",
         Broken = "Q4..If.the.machine.is.not.working..how.long.has.it.been.broken.",
         Rate = "Q5..How.would.you.rate.the.cleanliness.and.experience.in.the.store.")

mock$Cases <- as.double(mock$Cases)
mock$Price_12oz <- as.double(mock$Price_12oz)
mock$Price_20oz <- as.double(mock$Price_20oz)
mock$Price_32oz <- as.double(mock$Price_32oz)
mock$Machines <- as.double(mock$Machines)
mock$Machines <- round(mock$Machines)

mock <- filter(mock, !is.na(Cases), Cases > 0,
               !(is.na(Price_12oz) & is.na(Price_20oz) & is.na(Price_32oz)),
               Machines <= 3)

mock$Broken <- as.factor(mock$Broken)
mock$Rate <- as.factor(mock$Rate) 
mock$Machines <- as.factor(mock$Machines)

### show the data set
summary(mock)
##      Cases        Machines   Price_12oz      Price_20oz      Price_32oz   
##  Min.   :  1.00   0:  53   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 10.00   1:1103   1st Qu.:1.290   1st Qu.:1.990   1st Qu.:2.690  
##  Median : 18.00   2: 320   Median :1.490   Median :1.990   Median :2.990  
##  Mean   : 23.64   3:  40   Mean   :1.554   Mean   :2.138   Mean   :2.987  
##  3rd Qu.: 31.00            3rd Qu.:1.690   3rd Qu.:2.390   3rd Qu.:3.490  
##  Max.   :153.00            Max.   :3.590   Max.   :3.990   Max.   :9.590  
##                            NA's   :130     NA's   :65      NA's   :244    
##                                   Broken    
##  Less than 24 hours                  :  15  
##  More than 1 month                   :  32  
##  More than 1 week, less than 1 month :  25  
##  More than 24 hours, less than 1 week:  17  
##  Not Broken                          :1392  
##  Store clerk unsure                  :  35  
##                                             
##                                                                               Rate    
##  1 - Store is filthy, old appliances, etc. It is noticeably bad.                : 49  
##  2 - Store is average in these aspects. There is nothing out of the ordinary.   :543  
##  3 - Store is clean, modern, etc. You are impressed with over all look and feel.:924  
##                                                                                       
##                                                                                       
##                                                                                       
## 
head(mock)
##   Cases Machines Price_12oz Price_20oz Price_32oz     Broken
## 1    16        1       1.34       1.93         NA Not Broken
## 2    83        2       1.49       1.89         NA Not Broken
## 3    28        1       1.49       1.99         NA Not Broken
## 4     6        1       2.29       3.49         NA Not Broken
## 5    26        1       1.99       2.59         NA Not Broken
## 6   149        2       1.39       1.99         NA Not Broken
##                                                                              Rate
## 1    2 - Store is average in these aspects. There is nothing out of the ordinary.
## 2 3 - Store is clean, modern, etc. You are impressed with over all look and feel.
## 3 3 - Store is clean, modern, etc. You are impressed with over all look and feel.
## 4    2 - Store is average in these aspects. There is nothing out of the ordinary.
## 5 3 - Store is clean, modern, etc. You are impressed with over all look and feel.
## 6 3 - Store is clean, modern, etc. You are impressed with over all look and feel.
# create data set only with continuous variable
mock_cont <- mock %>% select(Cases, Price_12oz : Price_32oz)

### scatterplot matrix (only with continuous)
pairs(mock_cont, pch=5)

### correlation matrix (only used for continuous variables)
cor(mock_cont, use = "na.or.complete")
##                  Cases Price_12oz  Price_20oz  Price_32oz
## Cases       1.00000000 0.02748816 -0.01529701 -0.05556294
## Price_12oz  0.02748816 1.00000000  0.74622613  0.49811434
## Price_20oz -0.01529701 0.74622613  1.00000000  0.75745704
## Price_32oz -0.05556294 0.49811434  0.75745704  1.00000000
round(cor(mock_cont, use = "na.or.complete"),2)
##            Cases Price_12oz Price_20oz Price_32oz
## Cases       1.00       0.03      -0.02      -0.06
## Price_12oz  0.03       1.00       0.75       0.50
## Price_20oz -0.02       0.75       1.00       0.76
## Price_32oz -0.06       0.50       0.76       1.00
corrplot(cor(mock_cont, , use = "na.or.complete"), type = "upper")

The correlation matrix does not show that price of the drink has a strong correlation with cases. Since the aim of the analysis is to see how well each of our machines are performing from a case sales perspective. I decide to remove price factors because they does they did not have linear relationship with cases.

# read excel file
mock <- read.csv("Mock Data Analysis-1.csv", header = TRUE, sep = ",") %>% 
  select("Cases", "Machines", 
         "Q4..If.the.machine.is.not.working..how.long.has.it.been.broken.",
         "Q5..How.would.you.rate.the.cleanliness.and.experience.in.the.store.") %>% 
  rename(Broken = "Q4..If.the.machine.is.not.working..how.long.has.it.been.broken.",
         Rate = "Q5..How.would.you.rate.the.cleanliness.and.experience.in.the.store.")

# turn continuous variable to factors
mock$Cases <- as.double(mock$Cases)
## Warning: NAs introduced by coercion
# turn machine to whole number that represent number of machine for more than
# a year at the store
mock$Machines <- as.double(mock$Machines)
## Warning: NAs introduced by coercion
mock$Machines <- round(mock$Machines)

# there are less than 10 rows have machine > 3, they are very small compare to
# the entire data
mock <- filter(mock, !is.na(Cases), Cases > 0, Machines <= 3)

# turn categorical variables to factors
mock$Broken <- as.factor(mock$Broken)
mock$Rate <- as.factor(mock$Rate) 
mock$Machines <- as.factor(mock$Machines)

# change name of levels in Broken and Rate
levels(mock$Broken) <- list(BK = "Less than 24 hours", BK = "More than 1 month",
                            BK = "More than 1 week, less than 1 month",
                            BK = "More than 24 hours, less than 1 week",
                            NB = "Not Broken", NB = "Store clerk unsure", NB = "<NA>")

levels(mock$Rate) <- list(L1 = "1 - Store is filthy, old appliances, etc. It is noticeably bad.", 
                          L2 = "2 - Store is average in these aspects. There is nothing out of the ordinary." ,
                          L3 = "3 - Store is clean, modern, etc. You are impressed with over all look and feel.")

# show the data set
summary(mock)
##      Cases        Machines  Broken       Rate     
##  Min.   :  1.00   0:  55   BK  : 114   L1  :  76  
##  1st Qu.:  9.00   1:1304   NB  :1631   L2  : 656  
##  Median : 17.00   2: 350   NA's:   7   L3  :1014  
##  Mean   : 22.66   3:  43               NA's:   6  
##  3rd Qu.: 30.00                                   
##  Max.   :153.00
head(mock)
##   Cases Machines Broken Rate
## 1    16        1     NB   L2
## 2     4        1     BK   L2
## 3    83        2     NB   L3
## 4    28        1     NB   L3
## 5     6        1     NB   L2
## 6    33        2     NB   L3
### Box Plot (for categorical)
# remove missing value
mock <- mock %>% 
  filter(!is.na(Rate), !is.na(Broken))

# create a box plot function 
BoxPlotMock <- function(variable, name) {
  ggplot(data = mock, mapping = aes(x = variable, y = Cases)) +
    geom_boxplot() +
    theme_bw() +
    xlab(name) +
    theme(aspect.ratio = 1)
}

# output box plot
BoxPlotMock(mock$Machines, "Machines")

BoxPlotMock(mock$Broken, "Broken")

BoxPlotMock(mock$Rate, "Rate")

### Interaction plot between Machine, Broken and Rate
ggplot(data = mock,
       mapping = aes(x = Machines, y = Cases,
                     color = Broken, shape = Rate)) +
  geom_point(size = 2) +
  theme_bw() +
  theme(aspect.ratio = 1)

From our exploratory data analyses, I notice several interesting features. The box plot shows Machines clearly has strong positive correlation with Cases, and Rate might have some positive correlation, but Broken does not show a strong correlation with Cases. Additionally, I will consider including interaction terms, particularly an interaction between Rate and Machines.

I now want to fit a multiple linear regression model to the data set with Cases as the response and the remaining variables as predictors. Here is the general linear model I want to fit:

\(\text{Cases_i} = \beta_0 + \beta_1\times\text{Machines}_i +\) \(\beta_2\times\text{Broken}_i + \beta_3\times\text{Rate}_i + \epsilon_i\) where \(\epsilon_i \sim N(0, \sigma^2)\)


Variable Selection

I will start by doing variable selection to find the simplest model possible. I also think multicollinearity will be an issue within the variables, so variable selection will help reduce this problem.

# create dummy variable
mock$machine1 <- ifelse(mock$Machines == "1", 1, 0)
mock$machine2 <- ifelse(mock$Machines == "2", 1, 0)
mock$machine3 <- ifelse(mock$Machines == "3", 1, 0)
mock$RateL2 <- ifelse(mock$Rate == "L2", 1, 0)
mock$RateL3 <- ifelse(mock$Rate == "L3", 1, 0)
mock$BrokenNB <- ifelse(mock$Broken == "NB", 1, 0)
mock$BrokenBK <- ifelse(mock$Broken == "BK", 1, 0)
# change response variable to the last column
# DON'T include baseline variables
mock_select <- mock %>% select(machine1 : machine3, RateL2, RateL3, BrokenBK, Cases)
# best subsets
head(mock_select)
##   machine1 machine2 machine3 RateL2 RateL3 BrokenBK Cases
## 1        1        0        0      1      0        0    16
## 2        1        0        0      1      0        1     4
## 3        0        1        0      0      1        0    83
## 4        1        0        0      0      1        0    28
## 5        1        0        0      1      0        0     6
## 6        0        1        0      0      1        0    33
best_subset_aic <- bestglm(mock_select, IC = "BIC", method = "exhaustive")
best_subset_aic$BestModels 
##   machine1 machine2 machine3 RateL2 RateL3 BrokenBK Criterion
## 1     TRUE     TRUE     TRUE  FALSE   TRUE     TRUE  9862.650
## 2     TRUE     TRUE     TRUE   TRUE   TRUE     TRUE  9863.153
## 3     TRUE     TRUE     TRUE  FALSE  FALSE     TRUE  9869.848
## 4     TRUE     TRUE     TRUE   TRUE   TRUE    FALSE  9870.426
## 5     TRUE     TRUE     TRUE   TRUE  FALSE     TRUE  9871.368
best_model_subset <- best_subset_aic$BestModel
summary(best_model_subset)
## 
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), 
##     drop = FALSE], y = y))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.193  -9.934  -2.927   7.066 109.946 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.988      2.321   1.718 0.085950 .  
## machine1      11.946      2.308   5.177 2.52e-07 ***
## machine2      35.932      2.431  14.783  < 2e-16 ***
## machine3      51.071      3.411  14.972  < 2e-16 ***
## RateL3         3.133      0.818   3.831 0.000132 ***
## BrokenBK      -7.007      1.627  -4.307 1.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.75 on 1738 degrees of freedom
## Multiple R-squared:  0.3282, Adjusted R-squared:  0.3263 
## F-statistic: 169.8 on 5 and 1738 DF,  p-value: < 2.2e-16
# # forward selection
# best_subset_aic_forward <- bestglm(mock_select, IC = "BIC", method = "forward")
# best_subset_aic_forward$BestModels 
# best_model_subset_forward <- best_subset_aic_forward$BestModel
# summary(best_model_subset_forward)
#  backward selection
best_subset_cv <- bestglm(mock_select, IC = "BIC", method = "backward", t = 100)
best_model_subset_cv <- best_subset_cv$BestModel
summary(best_model_subset_cv)
## 
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), 
##     drop = FALSE], y = y))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.193  -9.934  -2.927   7.066 109.946 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.988      2.321   1.718 0.085950 .  
## machine1      11.946      2.308   5.177 2.52e-07 ***
## machine2      35.932      2.431  14.783  < 2e-16 ***
## machine3      51.071      3.411  14.972  < 2e-16 ***
## RateL3         3.133      0.818   3.831 0.000132 ***
## BrokenBK      -7.007      1.627  -4.307 1.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.75 on 1738 degrees of freedom
## Multiple R-squared:  0.3282, Adjusted R-squared:  0.3263 
## F-statistic: 169.8 on 5 and 1738 DF,  p-value: < 2.2e-16
# step wise/sequential
best_subset_ss <- bestglm(mock_select, IC = "BIC", method = "seqrep", t = 100)
best_model_subset_ss <- best_subset_ss$BestModel
summary(best_model_subset_ss)
## 
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), 
##     drop = FALSE], y = y))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.193  -9.934  -2.927   7.066 109.946 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.988      2.321   1.718 0.085950 .  
## machine1      11.946      2.308   5.177 2.52e-07 ***
## machine2      35.932      2.431  14.783  < 2e-16 ***
## machine3      51.071      3.411  14.972  < 2e-16 ***
## RateL3         3.133      0.818   3.831 0.000132 ***
## BrokenBK      -7.007      1.627  -4.307 1.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.75 on 1738 degrees of freedom
## Multiple R-squared:  0.3282, Adjusted R-squared:  0.3263 
## F-statistic: 169.8 on 5 and 1738 DF,  p-value: < 2.2e-16
# LASSO
set.seed(12345)  # make sure to set your seed when doing cross validation!
mock_select_x <- as.matrix(mock_select[, 1:6])
mock_select_y <- mock_select[, 7]
# use cross validation to pick the "best" (based on MSE) lambda
mock_select_ridge_LASSO <- cv.glmnet(x = mock_select_x,
                          y = mock_select_y, 
                          type.measure = "mse", 
                          alpha = 1)  # 1 is code for "LASSO"

# plot (log) lambda vs MSE
# autoplot(mock_select_ridge_LASSO, label = FALSE) +
#   theme_bw() +
#   theme(aspect.ratio = 1)

# lambda.min: value of lambda that gives minimum mean cross-validated error
mock_select_ridge_LASSO$lambda.min
## [1] 0.01166017
# lambda.1se: value of lambda within 1 standard error of the minimum 
# cross-validated error
mock_select_ridge_LASSO$lambda.1se
## [1] 1.772242
coef(mock_select_ridge_LASSO, s = "lambda.min")
## 7 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) -0.2518514
## machine1    11.6050102
## machine2    35.5405534
## machine3    50.6646194
## RateL2       5.0627458
## RateL3       7.6703472
## BrokenBK    -6.3145739
coef(mock_select_ridge_LASSO, s = "lambda.1se")
## 7 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) 18.0134509
## machine1     .        
## machine2    20.0455649
## machine3    27.1570977
## RateL2       .        
## RateL3       .        
## BrokenBK    -0.1589097
# Elastic Net
set.seed(12345)  # make sure to set your seed when doing cross validation!
# use cross validation to pick the "best" (based on MSE) lambda
mock_select_ridge_net <- cv.glmnet(x = mock_select_x,
                          y = mock_select_y, 
                          type.measure = "mse", 
                          alpha = 0.5)  

# plot (log) lambda vs MSE
# autoplot(mock_select_ridge_net, label = FALSE) +
#   theme_bw() +
#   theme(aspect.ratio = 1)

# lambda.min: value of lambda that gives minimum mean cross-validated error
mock_select_ridge_net$lambda.min
## [1] 0.0160738
# lambda.1se: value of lambda within 1 standard error of the minimum 
# cross-validated error
mock_select_ridge_net$lambda.1se
## [1] 2.681271
coef(mock_select_ridge_net, s = "lambda.min")
## 7 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) -0.3000673
## machine1    11.5979015
## machine2    35.5277716
## machine3    50.6602989
## RateL2       5.1228120
## RateL3       7.7286651
## BrokenBK    -6.3186443
coef(mock_select_ridge_net, s = "lambda.1se")
## 7 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) 18.0315155
## machine1    -0.3431857
## machine2    19.4528304
## machine3    27.7585993
## RateL2       .        
## RateL3       0.7611503
## BrokenBK    -1.7066143
Variable Best Subset Backward Sequential Replacement LASSO Elastic Net
machine1 X X X X
machine2 X X X X X
machine3 X X X X X
Rate2
Rate3 X X X X
BrokenBK X X X X X

Given the results from all of the variable selection procedures, shown in the table above, I choose to keep machine1, machine1, machine1, Rate3, and BrokenBK in our final model since four out five methods suggest these varibales are significant to the response variable.


Initial Linear Model

# base model: machine0, RateL1 +BrokenNB
mock_lm <- lm(Cases ~ machine1 + machine2 + machine3 + 
                RateL3 + BrokenBK, data = mock_select)
summary(mock_lm)
## 
## Call:
## lm(formula = Cases ~ machine1 + machine2 + machine3 + RateL3 + 
##     BrokenBK, data = mock_select)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.193  -9.934  -2.927   7.066 109.946 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.988      2.321   1.718 0.085950 .  
## machine1      11.946      2.308   5.177 2.52e-07 ***
## machine2      35.932      2.431  14.783  < 2e-16 ***
## machine3      51.071      3.411  14.972  < 2e-16 ***
## RateL3         3.133      0.818   3.831 0.000132 ***
## BrokenBK      -7.007      1.627  -4.307 1.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.75 on 1738 degrees of freedom
## Multiple R-squared:  0.3282, Adjusted R-squared:  0.3263 
## F-statistic: 169.8 on 5 and 1738 DF,  p-value: < 2.2e-16
# anova test: break and machine
mock_lm_inter_BkMach <- lm(Cases ~ machine1 + machine2 + machine3 + 
                RateL3 + BrokenBK + BrokenBK:machine1+ BrokenBK:machine2
                + BrokenBK:machine3
                , data = mock_select)

anova(mock_lm, mock_lm_inter_BkMach)
## Analysis of Variance Table
## 
## Model 1: Cases ~ machine1 + machine2 + machine3 + RateL3 + BrokenBK
## Model 2: Cases ~ machine1 + machine2 + machine3 + RateL3 + BrokenBK + 
##     BrokenBK:machine1 + BrokenBK:machine2 + BrokenBK:machine3
##   Res.Df    RSS Df Sum of Sq    F Pr(>F)
## 1   1738 487832                         
## 2   1735 486856  3    976.52 1.16 0.3237

interaction between broken and machine are not significant.

# anova test: rate and machine
mock_lm_inter_BkRate <- lm(Cases ~ machine1 + machine2 + machine3  + 
                RateL3 + BrokenBK + BrokenBK:RateL3
                , data = mock_select)

anova(mock_lm, mock_lm_inter_BkRate)
## Analysis of Variance Table
## 
## Model 1: Cases ~ machine1 + machine2 + machine3 + RateL3 + BrokenBK
## Model 2: Cases ~ machine1 + machine2 + machine3 + RateL3 + BrokenBK + 
##     BrokenBK:RateL3
##   Res.Df    RSS Df Sum of Sq     F Pr(>F)
## 1   1738 487832                          
## 2   1737 487163  1    668.92 2.385 0.1227

interaction between broken and rate are not significant.

Bases on the result, I create a linear model without interaction.

# base model: machine0, RateL1 +BrokenNB
mock_lm_inter <- lm(Cases ~ machine1 + machine2 + machine3 + 
                RateL3 + BrokenBK,
                data = mock_select)

summary(mock_lm_inter)
## 
## Call:
## lm(formula = Cases ~ machine1 + machine2 + machine3 + RateL3 + 
##     BrokenBK, data = mock_select)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.193  -9.934  -2.927   7.066 109.946 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.988      2.321   1.718 0.085950 .  
## machine1      11.946      2.308   5.177 2.52e-07 ***
## machine2      35.932      2.431  14.783  < 2e-16 ***
## machine3      51.071      3.411  14.972  < 2e-16 ***
## RateL3         3.133      0.818   3.831 0.000132 ***
## BrokenBK      -7.007      1.627  -4.307 1.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.75 on 1738 degrees of freedom
## Multiple R-squared:  0.3282, Adjusted R-squared:  0.3263 
## F-statistic: 169.8 on 5 and 1738 DF,  p-value: < 2.2e-16

Assumption Tests

# Linearity
# partial regression plots
avPlots(mock_lm_inter) + theme(aspect.ratio = 1)

## NULL
# residual vs fitted
autoplot(mock_lm_inter, which = 1, ncol = 1, nrow = 1) + theme_bw() + 
  theme(aspect.ratio = 1)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

The partial regression parts shows no weird trend on the blue line, and the residual vs fitted plot also has a fairly stright blue line. The linearity assumption probability mets.

On the other hand, the residual vs fitted plot does not equally distributed residuals. The homoscasticity assumption is not met.

## Nomality
# Boxplot
mock_select$residuals <- mock_lm_inter$residuals
ggplot(data = mock_select, mapping = aes(y = residuals)) + geom_boxplot() + 
  theme_bw() + theme(aspect.ratio = 1)

# Histogram
(mock_select_hist <- ggplot(data = mock_select, mapping = aes(x = residuals)) + 
  # only has x being residuals
  # when using this code for future data sets, make sure to change the binwidth: 
  geom_histogram(mapping = aes(y = ..density..), binwidth = 4) +
  # stat_function() overlays the red normal curve on the histogram
  stat_function(fun = dnorm, 
                color = "red", 
                size = 2,
                args = list(mean = mean(mock_select$residuals), 
                            sd = sd(mock_select$residuals)))  +
  theme(aspect.ratio = 1))

# QQ Plot
autoplot(mock_lm_inter, which = 2, ncol = 1, nrow = 1) + theme_bw() + 
  theme(aspect.ratio = 1)

# Shapiro-Wilk Test
shapiro.test(mock_lm_inter$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  mock_lm_inter$residuals
## W = 0.91837, p-value < 2.2e-16

The Shapiro-Wilk test indicates the dat is not nomally distributed. However, box plot has data distributed fairly evenly. Histogram is approxmitely normal, and the QQ plot is also roughly follow the straight line. The nomality assumption is probability met.

# Cook's Distance
mock_select$cooksd <- cooks.distance(mock_lm_inter)

# plot Cook's distance against the observation number
ggplot(data = mock_select) + 
  geom_point(mapping = aes(x = as.numeric(rownames(mock_select)), 
                           y = cooksd)) +
  theme_bw() +
  ylab("Cook's Distance") +
  xlab("Observation Number") +
  geom_hline(mapping = aes(yintercept = 4 / length(cooksd)),
             color = "red", linetype = "dashed") +
  # scale_x_continuous(limits = c(0, 300)) +
  scale_y_continuous(limits = c(0, 0.3)) +
  theme(aspect.ratio = 1)

# print a list of potential outliers according to Cook's distance
mock_select %>% 
  mutate(rowNum = row.names(mock_select)) %>%  # save original row numbers 
  filter(cooksd > 4 / length(cooksd)) %>%  # select potential outliers
  arrange(desc(cooksd))  # order from largest Cook's distance to smallest
##    machine1 machine2 machine3 RateL2 RateL3 BrokenBK Cases residuals
## 1         0        0        1      1      0        0   151  95.94043
## 2         0        0        1      0      1        0   153  94.80708
## 3         0        0        1      1      0        0   117  61.94043
## 4         0        0        1      0      1        0   117  58.80708
## 5         0        0        1      0      1        0     8 -50.19292
## 6         0        0        1      0      1        0   108  49.80708
## 7         0        0        1      0      1        1    12 -39.18577
## 8         0        0        1      1      0        0    10 -45.05957
## 9         0        0        1      1      0        0    11 -44.05957
## 10        0        0        1      0      1        0    15 -43.19292
## 11        0        0        1      0      0        0    13 -42.05957
## 12        0        0        1      0      1        0    18 -40.19292
## 13        0        1        0      0      1        0   153 109.94591
## 14        0        1        0      0      1        0   149 105.94591
## 15        0        0        1      0      1        0    21 -37.19292
## 16        0        0        1      0      1        1    83  31.81423
## 17        0        0        1      0      1        0    95  36.80708
## 18        0        0        1      1      0        0    19 -36.05957
## 19        0        0        1      1      0        0    91  35.94043
## 20        0        0        1      0      0        0    20 -35.05957
## 21        0        1        0      1      0        0   123  83.07926
## 22        0        0        1      0      1        0    25 -33.19292
## 23        0        0        1      0      1        0    25 -33.19292
## 24        0        0        1      1      0        0    87  31.94043
## 25        0        1        0      1      0        0   117  77.07926
## 26        0        1        0      1      0        0   111  71.07926
## 27        0        1        0      0      1        0   117  73.94591
## 28        0        0        0      1      0        0    33  29.01174
## 29        0        1        0      1      0        0   104  64.07926
## 30        0        0        1      1      0        0    30 -25.05957
## 31        0        1        0      1      0        0    98  58.07926
## 32        1        0        0      1      0        1    46  37.07309
## 33        0        1        0      0      1        1     3 -33.04694
## 34        0        1        0      1      0        0    96  56.07926
## 35        0        1        0      0      0        1    64  31.08641
## 36        0        1        0      1      0        0    94  54.07926
## 37        0        1        0      0      1        0   102  58.94591
## 38        0        1        0      1      0        1     3 -29.91359
## 39        0        1        0      1      0        0    92  52.07926
## 40        0        0        1      0      1        0    79  20.80708
## 41        0        1        0      0      1        0   101  57.94591
## 42        0        1        0      0      1        0    98  54.94591
## 43        0        1        0      0      1        0    98  54.94591
## 44        0        1        0      0      0        1    59  26.08641
## 45        1        0        0      0      1        1    41  28.93974
## 46        0        1        0      0      1        0    93  49.94591
## 47        0        1        0      0      1        0    92  48.94591
## 48        0        1        0      0      1        1    11 -25.04694
## 49        0        0        1      1      0        0    38 -17.05957
## 50        1        0        0      1      0        0    84  68.06594
## 51        1        0        0      1      0        0    84  68.06594
## 52        1        0        0      0      1        0    95  75.93259
## 53        0        1        0      0      1        0    90  46.94591
## 54        0        1        0      0      1        1    12 -24.04694
## 55        0        1        0      1      0        1    56  23.08641
## 56        0        0        0      1      0        0    22  18.01174
## 57        0        1        0      1      0        0    80  40.07926
## 58        0        1        0      1      0        0    80  40.07926
## 59        0        1        0      0      1        0    88  44.94591
## 60        0        1        0      0      1        0    88  44.94591
## 61        0        1        0      0      1        1    13 -23.04694
## 62        0        0        1      0      1        0    74  15.80708
## 63        0        1        0      0      1        0    87  43.94591
## 64        0        1        0      0      1        0    87  43.94591
## 65        0        1        0      1      0        0     1 -38.92074
## 66        0        1        0      1      0        0    78  38.07926
## 67        0        1        0      1      0        0    78  38.07926
## 68        0        1        0      1      0        0     2 -37.92074
## 69        0        0        1      1      0        0    70  14.94043
## 70        0        1        0      0      1        0     1 -42.05409
## 71        1        0        0      1      0        0    75  59.06594
## 72        0        1        0      1      0        0    77  37.07926
## 73        0        1        0      1      0        0     3 -36.92074
## 74        0        1        0      1      0        0     3 -36.92074
## 75        0        1        0      0      1        0     2 -41.05409
## 76        0        1        0      0      1        0     2 -41.05409
## 77        0        1        0      0      1        0    84  40.94591
## 78        0        1        0      1      0        0     4 -35.92074
## 79        0        1        0      1      0        0     4 -35.92074
## 80        0        1        0      1      0        0     4 -35.92074
## 81        0        1        0      0      1        0    83  39.94591
## 82        0        1        0      1      0        0    75  35.07926
## 83        0        1        0      1      0        0    75  35.07926
## 84        0        1        0      0      1        0     4 -39.05409
## 85        0        1        0      0      1        0     4 -39.05409
## 86        0        1        0      0      1        0    82  38.94591
## 87        0        1        0      0      1        0     5 -38.05409
## 88        0        1        0      0      1        0     5 -38.05409
## 89        0        1        0      0      1        0     6 -37.05409
## 90        0        1        0      0      1        0    80  36.94591
## 91        0        0        1      0      1        0    45 -13.19292
## 92        1        0        0      1      0        1    30  21.07309
## 93        0        1        0      1      0        0    72  32.07926
## 94        0        1        0      1      0        0     8 -31.92074
## 95        0        1        0      1      0        0     8 -31.92074
## 96        0        0        0      1      0        0    18  14.01174
## 97        0        1        0      0      1        0     8 -35.05409
## 98        0        1        0      0      1        0     8 -35.05409
## 99        0        1        0      0      1        0    78  34.94591
##         cooksd rowNum
## 1  0.139172797   1351
## 2  0.132630482    592
## 3  0.058009547    859
## 4  0.051029467    266
## 5  0.037174639    194
## 6  0.036605299   1050
## 7  0.030734176    455
## 8  0.030699060    948
## 9  0.029351581     97
## 10 0.027528780    361
## 11 0.026747343   1678
## 12 0.023837514    152
## 13 0.022801908    761
## 14 0.021172952      9
## 15 0.020411851   1120
## 16 0.020258519   1081
## 17 0.019990540   1165
## 18 0.019660387    847
## 19 0.019530690   1039
## 20 0.018585068   1677
## 21 0.016376985   1634
## 22 0.016257462   1109
## 23 0.016257462   1416
## 24 0.015425263   1658
## 25 0.014096905   1296
## 26 0.011987662   1411
## 27 0.010314337   1699
## 28 0.009973397    714
## 29 0.009742797   1598
## 30 0.009495071    885
## 31 0.008003701   1138
## 32 0.007812021   1736
## 33 0.007780446    108
## 34 0.007461966    622
## 35 0.006995339   1388
## 36 0.006939212    438
## 37 0.006554210    823
## 38 0.006477458   1694
## 39 0.006435441    122
## 40 0.006388282   1452
## 41 0.006333716    687
## 42 0.005694869    112
## 43 0.005694869   1422
## 44 0.004926021   1134
## 45 0.004923234   1163
## 46 0.004705577   1544
## 47 0.004519036   1498
## 48 0.004469422   1212
## 49 0.004400349    735
## 50 0.004341982   1025
## 51 0.004341982   1730
## 52 0.004341962   1579
## 53 0.004157273    314
## 54 0.004119663   1726
## 55 0.003858162    331
## 56 0.003844209   1721
## 57 0.003811431    724
## 58 0.003811431    920
## 59 0.003810600    770
## 60 0.003810600   1597
## 61 0.003784152    691
## 62 0.003686931    261
## 63 0.003642922    462
## 64 0.003642922   1143
## 65 0.003594272    137
## 66 0.003440533    723
## 67 0.003440533   1477
## 68 0.003411948   1568
## 69 0.003375027    727
## 70 0.003336027   1447
## 71 0.003269659    159
## 72 0.003262202    105
## 73 0.003234369    181
## 74 0.003234369   1470
## 75 0.003179260    603
## 76 0.003179260    651
## 77 0.003162525    593
## 78 0.003061535   1048
## 79 0.003061535   1378
## 80 0.003061535   1687
## 81 0.003009938      3
## 82 0.002919776    560
## 83 0.002919776   1118
## 84 0.002877042   1683
## 85 0.002877042   1708
## 86 0.002861124    170
## 87 0.002731592    655
## 88 0.002731592   1128
## 89 0.002589915    982
## 90 0.002574813   1319
## 91 0.002568288    969
## 92 0.002524075   1371
## 93 0.002441728   1311
## 94 0.002417657    359
## 95 0.002417657    454
## 96 0.002326375    791
## 97 0.002317878    475
## 98 0.002317878    536
## 99 0.002303593   1580
# calculate the DFFITS
mock_select$dffits <- dffits(mock_lm_inter)
# plot the DFFITS against the observation number
ggplot(data = mock_select) + 
  geom_point(mapping = aes(x = as.numeric(rownames(mock_select)), 
                           y = abs(dffits))) +
  theme_bw() +
  ylab("Absolute Value of DFFITS for Y") +
  xlab("Observation Number") +
  # for n > 30
  geom_hline(mapping = aes(yintercept = 2 * sqrt(length(mock_lm_inter$coefficients) /
                                                   length(dffits))),
             color = "red", linetype = "dashed") +
  # for n <= 30 (code for future, small data sets)
  # geom_hline(mapping = aes(yintercept = 1),
  #            color = "red", linetype = "dashed") +
  scale_y_continuous(limits = c(0, 0.8)) +
  # scale_y_continuous(limits = c(0, 1.1)) +
  theme(aspect.ratio = 1)
## Warning: Removed 2 rows containing missing values (geom_point).

# print a list of potential influential points according to DFFITS
# for n > 30
mock_select %>% 
  mutate(rowNum = row.names(mock_select)) %>%  # save original row numbers 
  # select potential influential pts
  filter(abs(dffits) > 2 * sqrt(length(mock_lm_inter$coefficients) / 
                                  length(dffits))) %>%
  arrange(desc(abs(dffits)))  # order from largest DFFITS to smallest
##     machine1 machine2 machine3 RateL2 RateL3 BrokenBK Cases residuals
## 1          0        0        1      1      0        0   151  95.94043
## 2          0        0        1      0      1        0   153  94.80708
## 3          0        0        1      1      0        0   117  61.94043
## 4          0        0        1      0      1        0   117  58.80708
## 5          0        0        1      0      1        0     8 -50.19292
## 6          0        0        1      0      1        0   108  49.80708
## 7          0        0        1      0      1        1    12 -39.18577
## 8          0        0        1      1      0        0    10 -45.05957
## 9          0        0        1      1      0        0    11 -44.05957
## 10         0        0        1      0      1        0    15 -43.19292
## 11         0        0        1      0      0        0    13 -42.05957
## 12         0        0        1      0      1        0    18 -40.19292
## 13         0        1        0      0      1        0   153 109.94591
## 14         0        1        0      0      1        0   149 105.94591
## 15         0        0        1      0      1        0    21 -37.19292
## 16         0        0        1      0      1        1    83  31.81423
## 17         0        0        1      0      1        0    95  36.80708
## 18         0        0        1      1      0        0    19 -36.05957
## 19         0        0        1      1      0        0    91  35.94043
## 20         0        0        1      0      0        0    20 -35.05957
## 21         0        1        0      1      0        0   123  83.07926
## 22         0        0        1      0      1        0    25 -33.19292
## 23         0        0        1      0      1        0    25 -33.19292
## 24         0        0        1      1      0        0    87  31.94043
## 25         0        1        0      1      0        0   117  77.07926
## 26         0        1        0      1      0        0   111  71.07926
## 27         0        1        0      0      1        0   117  73.94591
## 28         0        0        0      1      0        0    33  29.01174
## 29         0        1        0      1      0        0   104  64.07926
## 30         0        0        1      1      0        0    30 -25.05957
## 31         0        1        0      1      0        0    98  58.07926
## 32         1        0        0      1      0        1    46  37.07309
## 33         0        1        0      0      1        1     3 -33.04694
## 34         0        1        0      1      0        0    96  56.07926
## 35         0        1        0      0      0        1    64  31.08641
## 36         0        1        0      1      0        0    94  54.07926
## 37         0        1        0      0      1        0   102  58.94591
## 38         0        1        0      1      0        1     3 -29.91359
## 39         0        1        0      1      0        0    92  52.07926
## 40         0        0        1      0      1        0    79  20.80708
## 41         0        1        0      0      1        0   101  57.94591
## 42         0        1        0      0      1        0    98  54.94591
## 43         0        1        0      0      1        0    98  54.94591
## 44         0        1        0      0      0        1    59  26.08641
## 45         1        0        0      0      1        1    41  28.93974
## 46         0        1        0      0      1        0    93  49.94591
## 47         0        1        0      0      1        0    92  48.94591
## 48         0        1        0      0      1        1    11 -25.04694
## 49         0        0        1      1      0        0    38 -17.05957
## 50         1        0        0      0      1        0    95  75.93259
## 51         1        0        0      1      0        0    84  68.06594
## 52         1        0        0      1      0        0    84  68.06594
## 53         0        1        0      0      1        0    90  46.94591
## 54         0        1        0      0      1        1    12 -24.04694
## 55         0        1        0      1      0        1    56  23.08641
## 56         0        0        0      1      0        0    22  18.01174
## 57         0        1        0      0      1        0    88  44.94591
## 58         0        1        0      0      1        0    88  44.94591
## 59         0        1        0      1      0        0    80  40.07926
## 60         0        1        0      1      0        0    80  40.07926
## 61         0        1        0      0      1        1    13 -23.04694
## 62         0        0        1      0      1        0    74  15.80708
## 63         0        1        0      0      1        0    87  43.94591
## 64         0        1        0      0      1        0    87  43.94591
## 65         0        1        0      1      0        0     1 -38.92074
## 66         0        1        0      1      0        0    78  38.07926
## 67         0        1        0      1      0        0    78  38.07926
## 68         0        1        0      1      0        0     2 -37.92074
## 69         0        0        1      1      0        0    70  14.94043
## 70         0        1        0      0      1        0     1 -42.05409
## 71         1        0        0      1      0        0    75  59.06594
## 72         0        1        0      1      0        0    77  37.07926
## 73         0        1        0      1      0        0     3 -36.92074
## 74         0        1        0      1      0        0     3 -36.92074
## 75         0        1        0      0      1        0     2 -41.05409
## 76         0        1        0      0      1        0     2 -41.05409
## 77         0        1        0      0      1        0    84  40.94591
## 78         0        1        0      1      0        0     4 -35.92074
## 79         0        1        0      1      0        0     4 -35.92074
## 80         0        1        0      1      0        0     4 -35.92074
## 81         0        1        0      0      1        0    83  39.94591
## 82         0        1        0      1      0        0    75  35.07926
## 83         0        1        0      1      0        0    75  35.07926
## 84         0        1        0      0      1        0     4 -39.05409
## 85         0        1        0      0      1        0     4 -39.05409
## 86         0        1        0      0      1        0    82  38.94591
## 87         0        1        0      0      1        0     5 -38.05409
## 88         0        1        0      0      1        0     5 -38.05409
## 89         0        1        0      0      1        0     6 -37.05409
## 90         0        1        0      0      1        0    80  36.94591
## 91         0        0        1      0      1        0    45 -13.19292
## 92         1        0        0      1      0        1    30  21.07309
## 93         0        1        0      1      0        0    72  32.07926
## 94         0        1        0      1      0        0     8 -31.92074
## 95         0        1        0      1      0        0     8 -31.92074
## 96         0        0        0      1      0        0    18  14.01174
## 97         0        1        0      0      1        0     8 -35.05409
## 98         0        1        0      0      1        0     8 -35.05409
## 99         0        1        0      0      1        0    78  34.94591
## 100        0        1        0      1      0        0    71  31.07926
##          cooksd     dffits rowNum
## 1   0.139172797  0.9225034   1351
## 2   0.132630482  0.9003462    592
## 3   0.058009547  0.5921854    859
## 4   0.051029467  0.5551926    266
## 5   0.037174639 -0.4733972    194
## 6   0.036605299  0.4697390   1050
## 7   0.030734176 -0.4299999    455
## 8   0.030699060 -0.4299732    948
## 9   0.029351581 -0.4203914     97
## 10  0.027528780 -0.4070956    361
## 11  0.026747343 -0.4012356   1678
## 12  0.023837514 -0.3787206    152
## 13  0.022801908  0.3744572    761
## 14  0.021172952  0.3605058      9
## 15  0.020411851 -0.3503672   1120
## 16  0.020258519  0.3489155   1081
## 17  0.019990540  0.3467220   1165
## 18  0.019660387 -0.3438275    847
## 19  0.019530690  0.3426885   1039
## 20  0.018585068 -0.3342675   1677
## 21  0.016376985  0.3156271   1634
## 22  0.016257462 -0.3125935   1109
## 23  0.016257462 -0.3125935   1416
## 24  0.015425263  0.3044618   1658
## 25  0.014096905  0.2925392   1296
## 26  0.011987662  0.2695179   1411
## 27  0.010314337  0.2501076   1699
## 28  0.009973397  0.2447679    714
## 29  0.009742797  0.2427366   1598
## 30  0.009495071 -0.2387737    885
## 31  0.008003701  0.2198411   1138
## 32  0.007812021  0.2167459   1736
## 33  0.007780446 -0.2162446    108
## 34  0.007461966  0.2122205    622
## 35  0.006995339  0.2050174   1388
## 36  0.006939212  0.2046052    438
## 37  0.006554210  0.1989610    823
## 38  0.006477458 -0.1972679   1694
## 39  0.006435441  0.1969950    122
## 40  0.006388282  0.1958124   1452
## 41  0.006333716  0.1955620    687
## 42  0.005694869  0.1853723    112
## 43  0.005694869  0.1853723   1422
## 44  0.004926021  0.1719909   1134
## 45  0.004923234  0.1719700   1163
## 46  0.004705577  0.1684123   1544
## 47  0.004519036  0.1650236   1498
## 48  0.004469422 -0.1638170   1212
## 49  0.004400349 -0.1624901    735
## 50  0.004341962  0.1623225   1579
## 51  0.004341982  0.1621325   1025
## 52  0.004341982  0.1621325   1730
## 53  0.004157273  0.1582491    314
## 54  0.004119663 -0.1572686   1726
## 55  0.003858162  0.1521882    331
## 56  0.003844209  0.1518803   1721
## 57  0.003810600  0.1514786    770
## 58  0.003810600  0.1514786   1597
## 59  0.003811431  0.1514306    724
## 60  0.003811431  0.1514306    920
## 61  0.003784152 -0.1507212    691
## 62  0.003686931  0.1487295    261
## 63  0.003642922  0.1480948    462
## 64  0.003642922  0.1480948   1143
## 65  0.003594272 -0.1470395    137
## 66  0.003440533  0.1438509    723
## 67  0.003440533  0.1438509   1477
## 68  0.003411948 -0.1432503   1568
## 69  0.003375027  0.1422955    727
## 70  0.003336027 -0.1416957   1447
## 71  0.003269659  0.1405280    159
## 72  0.003262202  0.1400623    105
## 73  0.003234369 -0.1394619    181
## 74  0.003234369 -0.1394619   1470
## 75  0.003179260 -0.1383145    603
## 76  0.003179260 -0.1383145    651
## 77  0.003162525  0.1379487    593
## 78  0.003061535 -0.1356743   1048
## 79  0.003061535 -0.1356743   1378
## 80  0.003061535 -0.1356743   1687
## 81  0.003009938  0.1345684      3
## 82  0.002919776  0.1324878    560
## 83  0.002919776  0.1324878   1118
## 84  0.002877042 -0.1315546   1683
## 85  0.002877042 -0.1315546   1708
## 86  0.002861124  0.1311890    170
## 87  0.002731592 -0.1281758    655
## 88  0.002731592 -0.1281758   1128
## 89  0.002589915 -0.1247979    982
## 90  0.002574813  0.1244325   1319
## 91  0.002568288 -0.1241229    969
## 92  0.002524075  0.1230840   1371
## 93  0.002441728  0.1211322   1311
## 94  0.002417657 -0.1205324    359
## 95  0.002417657 -0.1205324    454
## 96  0.002326375  0.1181353    791
## 97  0.002317878 -0.1180444    475
## 98  0.002317878 -0.1180444    536
## 99  0.002303593  0.1176791   1580
## 100 0.002291870  0.1173486   1702

Although DFFITS plot does not show dots that are too far away from the rest of the points. There are two dots that might be outliers in the Cook’s distance plot. The model describes all observations assumption might not be met.

After fitting the linear regression model and checking the assumptions, I notice several assumptions may not be met. Specifically, the residual vs fitted plot does not equally distributed residuals. The homoscasticity assumption is not met. And describes all observations assumption might not be met.


Trying Several Linear Models

Since homoscedasticity was likely not met, I apply a Box-Cox transform to help us determine which transformation to use when transforming Y.

# close to 0. transform log
bc <- boxCox(mock_lm_inter, family="yjPower", plotit = TRUE)

bc$x[which.max(bc$y)]
## [1] 0.1818182

The Box-Cox has a value of 0.1818182, which suggests to transform suggested log transform on y and this helped with the homoscedasticity assumption.

# create trans linear model
mock_select$Log_Cases <- log(mock_select$Cases)
mock_lm_inter_trans <- lm(Log_Cases ~ machine1 + machine2 + machine3 + 
                RateL3 + BrokenBK
                , data = mock_select)
summary(mock_lm_inter_trans)
## 
## Call:
## lm(formula = Log_Cases ~ machine1 + machine2 + machine3 + RateL3 + 
##     BrokenBK, data = mock_select)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5681 -0.4844  0.1144  0.5764  2.2109 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.28561    0.11270  11.408  < 2e-16 ***
## machine1     1.16493    0.11203  10.399  < 2e-16 ***
## machine2     2.05135    0.11801  17.383  < 2e-16 ***
## machine3     2.39578    0.16560  14.467  < 2e-16 ***
## RateL3       0.23112    0.03971   5.820 7.00e-09 ***
## BrokenBK    -0.35038    0.07899  -4.435 9.76e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8134 on 1738 degrees of freedom
## Multiple R-squared:  0.2637, Adjusted R-squared:  0.2616 
## F-statistic: 124.5 on 5 and 1738 DF,  p-value: < 2.2e-16
# create predicting value for log y
machine1_values <- mock_select$machine1
machine2_values <- mock_select$machine2
machine3_values <- mock_select$machine3
RateL3_values <- mock_select$RateL3
BrokenBK_values <- mock_select$BrokenBK

log_Cases_preds <- predict(mock_lm_inter_trans, 
                         newdata = data.frame(machine1 = machine1_values,
                                              machine2 = machine2_values,
                                              machine3 = machine3_values,
                                              RateL3 = RateL3_values,
                                              BrokenBK = BrokenBK_values))


Cases_preds <- exp(log_Cases_preds)  ## use exp to "UNDO" the log transform

# Store results in a data frame for plotting
preds <- data.frame("machine1_values" = machine1_values,
                    "machine2_values" = machine2_values, 
                    "machine3_values" = machine3_values, 
                    "RateL3" = RateL3_values,
                    "BrokenBK" = BrokenBK_values,
                    "log_Cases_preds" = log_Cases_preds)
## Linearity, Homoscasticity
# partial regression plots
avPlots(mock_lm_inter_trans) + theme(aspect.ratio = 1)

## NULL
# residual vs fitted
autoplot(mock_lm_inter_trans, which = 1, ncol = 1, nrow = 1) + theme_bw() + 
  theme(aspect.ratio = 1)

Both partial regrssion plot and residuals vs fitted plot show no weird curve on the blue line. The linearity assumption is probability met.

The residual vs fitted plot has equally spread residuals aound a horizontal line with not distinct pattern, which indicates the homoscasticity assumption is met.

# Boxplot
mock_select$residuals <- mock_lm_inter_trans$residuals
ggplot(data = mock_select, mapping = aes(y = residuals)) + geom_boxplot() + 
  theme_bw() + theme(aspect.ratio = 1)

# Histogram
(mock_select_hist <- ggplot(data = mock_select, mapping = aes(x = residuals)) + 
  # only has x being residuals
  # when using this code for future data sets, make sure to change the binwidth: 
  geom_histogram(mapping = aes(y = ..density..)) +
  # stat_function() overlays the red normal curve on the histogram
  stat_function(fun = dnorm, 
                color = "red", 
                size = 2,
                args = list(mean = mean(mock_select$residuals), 
                            sd = sd(mock_select$residuals)))  +
  theme(aspect.ratio = 1))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# QQ Plot
autoplot(mock_lm_inter_trans, which = 2, ncol = 1, nrow = 1) + theme_bw() + 
  theme(aspect.ratio = 1)

# Shapiro-Wilk Test
shapiro.test(mock_lm_inter_trans$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  mock_lm_inter_trans$residuals
## W = 0.96838, p-value < 2.2e-16

The Shapiro-Wilk test indicates the dat is not nomally distributed. However, box plot has data distributed fairly evenly. Histogram is approxmitely normal, and the QQ plot is also roughly follow the straight line. The nomality assumption is probability met.

# Cook's Distance
mock_select$cooksd <- cooks.distance(mock_lm_inter_trans)

# plot Cook's distance against the observation number
ggplot(data = mock_select) + 
  geom_point(mapping = aes(x = as.numeric(rownames(mock_select)), 
                           y = cooksd)) +
  theme_bw() +
  ylab("Cook's Distance") +
  xlab("Observation Number") +
  geom_hline(mapping = aes(yintercept = 4 / length(cooksd)),
             color = "red", linetype = "dashed") +
  # scale_x_continuous(limits = c(0, 300)) +
  scale_y_continuous(limits = c(0, 0.08)) +
  theme(aspect.ratio = 1)

# print a list of potential outliers according to Cook's distance
mock_select %>% 
  mutate(rowNum = row.names(mock_select)) %>%  # save original row numbers 
  filter(cooksd > 4 / length(cooksd)) %>%  # select potential outliers
  arrange(desc(cooksd))  # order from largest Cook's distance to smallest
##    machine1 machine2 machine3 RateL2 RateL3 BrokenBK Cases  residuals
## 1         0        0        0      1      0        0    33  2.2109014
## 2         0        0        1      0      1        0     8 -1.8330642
## 3         0        0        0      1      0        0    22  1.8054363
## 4         0        1        0      0      1        1     3 -2.1190840
## 5         1        0        0      0      1        1     1 -2.3312761
## 6         0        0        0      1      0        0    18  1.6047656
## 7         0        0        1      1      0        0    10 -1.3787998
## 8         0        0        1      1      0        0   151  1.3358950
## 9         0        1        0      1      0        0     1 -3.3369544
## 10        0        0        0      0      1        0     1 -1.5167270
## 11        0        0        0      0      1        0     1 -1.5167270
## 12        0        1        0      1      0        1     3 -1.8879631
## 13        0        0        1      1      0        0    11 -1.2834896
## 14        0        1        0      0      1        0     1 -3.5680753
## 15        0        0        1      0      1        1    12 -1.0772201
## 16        0        0        0      0      0        1     8  1.1442144
## 17        0        0        1      0      1        0    15 -1.2044556
## 18        0        0        0      1      0        0     1 -1.2856061
## 19        0        0        1      0      0        0    13 -1.1164355
## 20        0        0        1      0      1        0   153  1.1179321
## 21        0        0        1      1      0        0   117  1.0807891
## 22        1        0        0      1      0        1    46  1.7284862
## 23        0        1        0      1      0        0     2 -2.6438072
## 24        0        0        0      0      1        0    15  1.1913232
## 25        1        0        0      0      1        1     2 -1.6381289
## 26        1        0        0      0      1        1     2 -1.6381289
## 27        0        1        0      0      1        0     2 -2.8749281
## 28        0        1        0      0      1        0     2 -2.8749281
## 29        0        0        1      0      1        0    18 -1.0221340
## 30        0        0        1      0      1        1    83  0.8567138
## 31        0        0        0      1      0        1     6  0.8565324
## 32        0        0        0      1      0        0    10  1.0169790
## 33        0        0        0      1      0        0    10  1.0169790
## 34        0        1        0      1      0        0     3 -2.2383421
## 35        0        1        0      1      0        0     3 -2.2383421
## 36        1        0        0      1      0        1     2 -1.4070080
## 37        1        0        0      1      0        1     2 -1.4070080
## 38        1        0        0      0      1        1    41  1.3822960
## 39        0        0        1      0      1        0    21 -0.8679834
## 40        0        0        0      0      1        0    12  0.9681796
## 41        0        0        1      0      1        0   117  0.8496681
## 42        0        0        1      1      0        0    91  0.8294746
## 43        0        1        0      0      0        1    64  1.1723077
## 44        1        0        0      1      0        1    30  1.3010422
## 45        0        0        1      1      0        0    87  0.7845232
## 46        0        1        0      1      0        0     4 -1.9506600
## 47        0        1        0      1      0        0     4 -1.9506600
## 48        0        1        0      1      0        0     4 -1.9506600
## 49        0        1        0      0      1        0     4 -2.1817809
## 50        0        1        0      0      1        0     4 -2.1817809
## 51        1        0        0      0      1        1     3 -1.2326638
## 52        1        0        0      0      1        1     3 -1.2326638
## 53        0        0        1      0      1        0   108  0.7696254
## 54        0        1        0      0      0        1    59  1.0909621
## 55        0        0        1      1      0        0    19 -0.7369459
## 56        0        1        0      1      0        1    56  1.0387763
## 57        0        0        0      0      1        0     2 -0.8235799
## 58        0        0        0      0      1        0     2 -0.8235799
## 59        0        0        0      0      1        0     2 -0.8235799
## 60        0        0        0      0      1        0     2 -0.8235799
## 61        0        0        0      0      1        0     2 -0.8235799
## 62        0        0        0      0      1        0     2 -0.8235799
## 63        0        0        0      0      1        0     2 -0.8235799
## 64        0        0        0      0      1        0     2 -0.8235799
## 65        0        0        0      0      1        0     2 -0.8235799
## 66        0        0        0      0      1        0     2 -0.8235799
## 67        0        0        0      0      1        0     2 -0.8235799
## 68        1        0        0      1      0        1    26  1.1579413
## 69        0        0        0      1      0        0     8  0.7938354
## 70        0        1        0      0      1        0     5 -1.9586374
## 71        0        1        0      0      1        0     5 -1.9586374
## 72        0        0        1      0      0        0    20 -0.6856526
## 73        0        0        1      0      1        0    25 -0.6936300
## 74        0        0        1      0      1        0    25 -0.6936300
## 75        1        0        0      0      0        1    23  1.0353390
## 76        0        0        1      0      1        0    95  0.6413711
## 77        0        1        0      0      1        0     6 -1.7763158
## 78        1        0        0      1      0        1     3 -1.0015429
## 79        1        0        0      1      0        1     3 -1.0015429
## 80        1        0        0      1      0        0     1 -2.4505342
## 81        1        0        0      1      0        0     1 -2.4505342
## 82        1        0        0      1      0        0     1 -2.4505342
## 83        1        0        0      1      0        0     1 -2.4505342
## 84        1        0        0      0      1        1    27  0.9645608
## 85        1        0        0      0      1        0     1 -2.6816551
## 86        1        0        0      0      1        0     1 -2.6816551
## 87        1        0        0      0      1        0     1 -2.6816551
## 88        1        0        0      0      1        0     1 -2.6816551
## 89        1        0        0      0      1        0     1 -2.6816551
##         cooksd      dffits Log_Cases rowNum
## 1  0.024573483  0.24476793 3.4965076    714
## 2  0.021035348 -0.47339722 2.0794415    194
## 3  0.016386730  0.15188029 3.0910425   1721
## 4  0.013572843 -0.21624464 1.0986123    108
## 5  0.013554427 -0.06567523 0.0000000    179
## 6  0.012946464  0.11813529 2.8903718    791
## 7  0.012195116 -0.42997321 2.3025851    948
## 8  0.011447961  0.92250336 5.0172798   1351
## 9  0.011209384 -0.14703954 0.0000000    137
## 10 0.011137215 -0.05064049 0.0000000    467
## 11 0.011137215 -0.05064049 0.0000000   1541
## 12 0.010946797 -0.19726791 1.0986123   1694
## 13 0.010567402 -0.42039136 2.3978953     97
## 14 0.010188578 -0.14169568 0.0000000   1447
## 15 0.009853857 -0.42999994 2.4849066    455
## 16 0.009480895  0.11149240 2.0794415   1395
## 17 0.009081883 -0.40709558 2.7080502    361
## 18 0.008308906 -0.02518954 0.0000000    211
## 19 0.007995596 -0.40123561 2.5649494   1678
## 20 0.007823934  0.90034618 5.0304379    592
## 21 0.007493167  0.59218539 4.7621739    859
## 22 0.007204603  0.21674586 3.8286414   1736
## 23 0.007036242 -0.14325026 0.6931472   1568
## 24 0.006871015  0.06517498 2.7080502   1271
## 25 0.006692523 -0.05973597 0.6931472    126
## 26 0.006692523 -0.05973597 0.6931472    522
## 27 0.006614538 -0.13831445 0.6931472    603
## 28 0.006614538 -0.13831445 0.6931472    651
## 29 0.006540486 -0.37872059 2.8903718    152
## 30 0.006232595  0.34891548 4.4188406   1081
## 31 0.005312783  0.09125197 1.7917595   1274
## 32 0.005199384  0.05067750 2.3025851    488
## 33 0.005199384  0.05067750 2.3025851    789
## 34 0.005043525 -0.13946186 1.0986123    181
## 35 0.005043525 -0.13946186 1.0986123   1470
## 36 0.004773879 -0.04044217 0.6931472   1058
## 37 0.004773879 -0.04044217 0.6931472   1127
## 38 0.004765361  0.17196999 3.7135721   1163
## 39 0.004716470 -0.35036721 3.0445224   1120
## 40 0.004538095  0.04035549 2.4849066    805
## 41 0.004519527  0.55519255 4.7621739    266
## 42 0.004413567  0.34268848 4.5108595   1039
## 43 0.004220689  0.20501744 4.1588831   1388
## 44 0.004081887  0.12308395 3.4011974   1371
## 45 0.003948164  0.30446182 4.4659081   1658
## 46 0.003830402 -0.13567433 1.3862944   1048
## 47 0.003830402 -0.13567433 1.3862944   1378
## 48 0.003830402 -0.13567433 1.3862944   1687
## 49 0.003809499 -0.13155455 1.3862944   1683
## 50 0.003809499 -0.13155455 1.3862944   1708
## 51 0.003789510 -0.05379709 1.0986123     75
## 52 0.003789510 -0.05379709 1.0986123    520
## 53 0.003708115  0.46973900 4.6821312   1050
## 54 0.003655270  0.17199093 4.0775374   1134
## 55 0.003483812 -0.34382755 2.9444390    847
## 56 0.003313937  0.15218824 4.0253517    331
## 57 0.003283773 -0.04236757 0.6931472    469
## 58 0.003283773 -0.04236757 0.6931472    550
## 59 0.003283773 -0.04236757 0.6931472   1278
## 60 0.003283773 -0.04236757 0.6931472   1279
## 61 0.003283773 -0.04236757 0.6931472   1282
## 62 0.003283773 -0.04236757 0.6931472   1286
## 63 0.003283773 -0.04236757 0.6931472   1314
## 64 0.003283773 -0.04236757 0.6931472   1472
## 65 0.003283773 -0.04236757 0.6931472   1478
## 66 0.003283773 -0.04236757 0.6931472   1615
## 67 0.003283773 -0.04236757 0.6931472   1616
## 68 0.003233340  0.09970495 3.2580965   1417
## 69 0.003168027  0.03381729 2.0794415   1275
## 70 0.003070108 -0.12817584 1.6094379    655
## 71 0.003070108 -0.12817584 1.6094379   1128
## 72 0.003015725 -0.33426752 2.9957323   1677
## 73 0.003011966 -0.31259349 3.2188758   1109
## 74 0.003011966 -0.31259349 3.2188758   1416
## 75 0.002584898  0.08217734 3.1354942   1010
## 76 0.002575213  0.34672205 4.5538769   1165
## 77 0.002525143 -0.12479792 1.7917595    982
## 78 0.002418897 -0.03460329 1.0986123     68
## 79 0.002418897 -0.03460329 1.0986123    383
## 80 0.002387714 -0.03541132 0.0000000    203
## 81 0.002387714 -0.03541132 0.0000000    343
## 82 0.002387714 -0.03541132 0.0000000    377
## 83 0.002387714 -0.03541132 0.0000000    381
## 84 0.002320347  0.08872064 3.2958369   1131
## 85 0.002297566 -0.03840670 0.0000000    320
## 86 0.002297566 -0.03840670 0.0000000    333
## 87 0.002297566 -0.03840670 0.0000000    399
## 88 0.002297566 -0.03840670 0.0000000   1334
## 89 0.002297566 -0.03840670 0.0000000   1474
# calculate the DFFITS
mock_select$dffits <- dffits(mock_lm_inter_trans)
# plot the DFFITS against the observation number
ggplot(data = mock_select) + 
  geom_point(mapping = aes(x = as.numeric(rownames(mock_select)), 
                           y = abs(dffits))) +
  theme_bw() +
  ylab("Absolute Value of DFFITS for Y") +
  xlab("Observation Number") +
  # for n > 30
  geom_hline(mapping = aes(yintercept = 2 * sqrt(length(mock_lm_inter_trans$coefficients) /
                                                   length(dffits))),
             color = "red", linetype = "dashed") +
  # for n <= 30 (code for future, small data sets)
  # geom_hline(mapping = aes(yintercept = 1),
  #            color = "red", linetype = "dashed") +
  scale_y_continuous(limits = c(0, 0.8)) +
  # scale_y_continuous(limits = c(0, 1.1)) +
  theme(aspect.ratio = 1)

# print a list of potential influential points according to DFFITS
# for n > 30
mock_select %>% 
  mutate(rowNum = row.names(mock_select)) %>%  # save original row numbers 
  # select potential influential pts
  filter(abs(dffits) > 2 * sqrt(length(mock_lm_inter_trans$coefficients) / 
                                  length(dffits))) %>%
  arrange(desc(abs(dffits)))  # order from largest DFFITS to smallest
##    machine1 machine2 machine3 RateL2 RateL3 BrokenBK Cases  residuals
## 1         0        0        0      1      0        0    33  2.2109014
## 2         0        0        1      0      1        0     8 -1.8330642
## 3         0        0        0      1      0        0    22  1.8054363
## 4         0        1        0      0      1        1     3 -2.1190840
## 5         1        0        0      0      1        1     1 -2.3312761
## 6         0        0        0      1      0        0    18  1.6047656
## 7         0        0        1      1      0        0    10 -1.3787998
## 8         0        0        1      1      0        0   151  1.3358950
## 9         0        1        0      1      0        0     1 -3.3369544
## 10        0        0        0      0      1        0     1 -1.5167270
## 11        0        0        0      0      1        0     1 -1.5167270
## 12        0        1        0      1      0        1     3 -1.8879631
## 13        0        0        1      1      0        0    11 -1.2834896
## 14        0        1        0      0      1        0     1 -3.5680753
## 15        0        0        1      0      1        1    12 -1.0772201
## 16        0        0        0      0      0        1     8  1.1442144
## 17        0        0        1      0      1        0    15 -1.2044556
## 18        0        0        0      1      0        0     1 -1.2856061
## 19        0        0        1      0      0        0    13 -1.1164355
## 20        0        0        1      0      1        0   153  1.1179321
## 21        0        0        1      1      0        0   117  1.0807891
## 22        1        0        0      1      0        1    46  1.7284862
## 23        0        1        0      1      0        0     2 -2.6438072
## 24        0        0        0      0      1        0    15  1.1913232
## 25        1        0        0      0      1        1     2 -1.6381289
## 26        1        0        0      0      1        1     2 -1.6381289
## 27        0        1        0      0      1        0     2 -2.8749281
## 28        0        1        0      0      1        0     2 -2.8749281
## 29        0        0        1      0      1        0    18 -1.0221340
## 30        0        0        1      0      1        1    83  0.8567138
## 31        0        0        0      1      0        1     6  0.8565324
## 32        0        0        0      1      0        0    10  1.0169790
## 33        0        0        0      1      0        0    10  1.0169790
## 34        0        1        0      1      0        0     3 -2.2383421
## 35        0        1        0      1      0        0     3 -2.2383421
## 36        1        0        0      1      0        1     2 -1.4070080
## 37        1        0        0      1      0        1     2 -1.4070080
## 38        1        0        0      0      1        1    41  1.3822960
## 39        0        0        1      0      1        0    21 -0.8679834
## 40        0        0        0      0      1        0    12  0.9681796
## 41        0        0        1      0      1        0   117  0.8496681
## 42        0        0        1      1      0        0    91  0.8294746
## 43        0        1        0      0      0        1    64  1.1723077
## 44        1        0        0      1      0        1    30  1.3010422
## 45        0        0        1      1      0        0    87  0.7845232
## 46        0        1        0      1      0        0     4 -1.9506600
## 47        0        1        0      1      0        0     4 -1.9506600
## 48        0        1        0      1      0        0     4 -1.9506600
## 49        0        1        0      0      1        0     4 -2.1817809
## 50        0        1        0      0      1        0     4 -2.1817809
## 51        1        0        0      0      1        1     3 -1.2326638
## 52        1        0        0      0      1        1     3 -1.2326638
## 53        0        0        1      0      1        0   108  0.7696254
## 54        0        1        0      0      0        1    59  1.0909621
## 55        0        0        1      1      0        0    19 -0.7369459
## 56        0        1        0      1      0        1    56  1.0387763
## 57        0        0        0      0      1        0     2 -0.8235799
## 58        0        0        0      0      1        0     2 -0.8235799
## 59        0        0        0      0      1        0     2 -0.8235799
## 60        0        0        0      0      1        0     2 -0.8235799
## 61        0        0        0      0      1        0     2 -0.8235799
## 62        0        0        0      0      1        0     2 -0.8235799
## 63        0        0        0      0      1        0     2 -0.8235799
## 64        0        0        0      0      1        0     2 -0.8235799
## 65        0        0        0      0      1        0     2 -0.8235799
## 66        0        0        0      0      1        0     2 -0.8235799
## 67        0        0        0      0      1        0     2 -0.8235799
## 68        1        0        0      1      0        1    26  1.1579413
## 69        0        0        0      1      0        0     8  0.7938354
## 70        0        1        0      0      1        0     5 -1.9586374
## 71        0        1        0      0      1        0     5 -1.9586374
## 72        0        0        1      0      0        0    20 -0.6856526
## 73        0        0        1      0      1        0    25 -0.6936300
## 74        0        0        1      0      1        0    25 -0.6936300
## 75        1        0        0      0      0        1    23  1.0353390
## 76        0        0        1      0      1        0    95  0.6413711
## 77        0        1        0      0      1        0     6 -1.7763158
## 78        1        0        0      1      0        1     3 -1.0015429
## 79        1        0        0      1      0        1     3 -1.0015429
## 80        1        0        0      1      0        0     1 -2.4505342
## 81        1        0        0      1      0        0     1 -2.4505342
## 82        1        0        0      1      0        0     1 -2.4505342
## 83        1        0        0      1      0        0     1 -2.4505342
## 84        1        0        0      0      1        1    27  0.9645608
## 85        1        0        0      0      1        0     1 -2.6816551
## 86        1        0        0      0      1        0     1 -2.6816551
## 87        1        0        0      0      1        0     1 -2.6816551
## 88        1        0        0      0      1        0     1 -2.6816551
## 89        1        0        0      0      1        0     1 -2.6816551
##         cooksd     dffits Log_Cases rowNum
## 1  0.024573483  0.3847045 3.4965076    714
## 2  0.021035348 -0.3556939 2.0794415    194
## 3  0.016386730  0.3139246 3.0910425   1721
## 4  0.013572843 -0.2858550 1.0986123    108
## 5  0.013554427 -0.2857789 0.0000000    179
## 6  0.012946464  0.2789476 2.8903718    791
## 7  0.012195116 -0.2706524 2.3025851    948
## 8  0.011447961  0.2622167 5.0172798   1351
## 9  0.011209384 -0.2605332 0.0000000    137
## 10 0.011137215 -0.2586914 0.0000000    467
## 11 0.011137215 -0.2586914 0.0000000   1541
## 12 0.010946797 -0.2566117 1.0986123   1694
## 13 0.010567402 -0.2519149 2.3978953     97
## 14 0.010188578 -0.2485609 0.0000000   1447
## 15 0.009853857 -0.2432093 2.4849066    455
## 16 0.009480895  0.2385776 2.0794415   1395
## 17 0.009081883 -0.2335175 2.7080502    361
## 18 0.008308906 -0.2233783 0.0000000    211
## 19 0.007995596 -0.2190874 2.5649494   1678
## 20 0.007823934  0.2167231 5.0304379    592
## 21 0.007493167  0.2120848 4.7621739    859
## 22 0.007204603  0.2081258 3.8286414   1736
## 23 0.007036242 -0.2060394 0.6931472   1568
## 24 0.006871015  0.2031114 2.7080502   1271
## 25 0.006692523 -0.2005663 0.6931472    126
## 26 0.006692523 -0.2005663 0.6931472    522
## 27 0.006614538 -0.1998812 0.6931472    603
## 28 0.006614538 -0.1998812 0.6931472    651
## 29 0.006540486 -0.1981335 2.8903718    152
## 30 0.006232595  0.1933874 4.4188406   1081
## 31 0.005312783  0.1785477 1.7917595   1274
## 32 0.005199384  0.1766550 2.3025851    488
## 33 0.005199384  0.1766550 2.3025851    789
## 34 0.005043525 -0.1742889 1.0986123    181
## 35 0.005043525 -0.1742889 1.0986123   1470
## 36 0.004773879 -0.1693418 0.6931472   1058
## 37 0.004773879 -0.1693418 0.6931472   1127
## 38 0.004765361  0.1691855 3.7135721   1163
## 39 0.004716470 -0.1682306 3.0445224   1120
## 40 0.004538095  0.1650319 2.4849066    805
## 41 0.004519527  0.1646785 4.7621739    266
## 42 0.004413567  0.1627342 4.5108595   1039
## 43 0.004220689  0.1591861 4.1588831   1388
## 44 0.004081887  0.1565684 3.4011974   1371
## 45 0.003948164  0.1539102 4.4659081   1658
## 46 0.003830402 -0.1518083 1.3862944   1048
## 47 0.003830402 -0.1518083 1.3862944   1378
## 48 0.003830402 -0.1518083 1.3862944   1687
## 49 0.003809499 -0.1514566 1.3862944   1683
## 50 0.003809499 -0.1514566 1.3862944   1708
## 51 0.003789510 -0.1508454 1.0986123     75
## 52 0.003789510 -0.1508454 1.0986123    520
## 53 0.003708115  0.1491564 4.6821312   1050
## 54 0.003655270  0.1481283 4.0775374   1134
## 55 0.003483812 -0.1445717 2.9444390    847
## 56 0.003313937  0.1410357 4.0253517    331
## 57 0.003283773 -0.1403679 0.6931472    469
## 58 0.003283773 -0.1403679 0.6931472    550
## 59 0.003283773 -0.1403679 0.6931472   1278
## 60 0.003283773 -0.1403679 0.6931472   1279
## 61 0.003283773 -0.1403679 0.6931472   1282
## 62 0.003283773 -0.1403679 0.6931472   1286
## 63 0.003283773 -0.1403679 0.6931472   1314
## 64 0.003283773 -0.1403679 0.6931472   1472
## 65 0.003283773 -0.1403679 0.6931472   1478
## 66 0.003283773 -0.1403679 0.6931472   1615
## 67 0.003283773 -0.1403679 0.6931472   1616
## 68 0.003233340  0.1393260 3.2580965   1417
## 69 0.003168027  0.1378689 2.0794415   1275
## 70 0.003070108 -0.1359113 1.6094379    655
## 71 0.003070108 -0.1359113 1.6094379   1128
## 72 0.003015725 -0.1345047 2.9957323   1677
## 73 0.003011966 -0.1344215 3.2188758   1109
## 74 0.003011966 -0.1344215 3.2188758   1416
## 75 0.002584898  0.1245595 3.1354942   1010
## 76 0.002575213  0.1242902 4.5538769   1165
## 77 0.002525143 -0.1232231 1.7917595    982
## 78 0.002418897 -0.1204899 1.0986123     68
## 79 0.002418897 -0.1204899 1.0986123    383
## 80 0.002387714 -0.1199722 0.0000000    203
## 81 0.002387714 -0.1199722 0.0000000    343
## 82 0.002387714 -0.1199722 0.0000000    377
## 83 0.002387714 -0.1199722 0.0000000    381
## 84 0.002320347  0.1180061 3.2958369   1131
## 85 0.002297566 -0.1177467 0.0000000    320
## 86 0.002297566 -0.1177467 0.0000000    333
## 87 0.002297566 -0.1177467 0.0000000    399
## 88 0.002297566 -0.1177467 0.0000000   1334
## 89 0.002297566 -0.1177467 0.0000000   1474

There are no points that are extremely far away from the rest of the data in both Cook’s Distance and DFFITS plots. Therefore, the describes all observations assumption is probability met.

The independent assumption should meet because the data are collected from all stores, and additional predictor variables are not required for the purpose of the analysis. However, some of the more detail data about machines, like machine temperature might be helpful for more precise prediction.

Although the selection method and anova test do not suggest strong interaction between variables. I decided to try including an interaction plot between Broken and Rate since the results from our EDA suggested there might be a significant interaction between those two variables, which can be beneficial for future reference and indicator.

interaction.plot(x.factor = mock$Broken, 
                 trace.factor = mock$Rate, 
                 response = mock$Cases,
                 col = c("#1b9e77", "#d95f02"),
                 lwd = 2,
                 trace.label = "Rate",
                 ylab = "Cases",
                 xlab = "Broken")

The interaction plot shows Broken has strong interaction with Rate, specifically for Rate 1. This might be a great target for a more appropriate dataset to see how they affect the sales.


Final Linear Model

mock_lm_inter_trans <- lm(Log_Cases ~ machine1 + machine2 + machine3 + 
                RateL3 + BrokenBK
                , data = mock_select)
summary(mock_lm_inter_trans)
## 
## Call:
## lm(formula = Log_Cases ~ machine1 + machine2 + machine3 + RateL3 + 
##     BrokenBK, data = mock_select)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5681 -0.4844  0.1144  0.5764  2.2109 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.28561    0.11270  11.408  < 2e-16 ***
## machine1     1.16493    0.11203  10.399  < 2e-16 ***
## machine2     2.05135    0.11801  17.383  < 2e-16 ***
## machine3     2.39578    0.16560  14.467  < 2e-16 ***
## RateL3       0.23112    0.03971   5.820 7.00e-09 ***
## BrokenBK    -0.35038    0.07899  -4.435 9.76e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8134 on 1738 degrees of freedom
## Multiple R-squared:  0.2637, Adjusted R-squared:  0.2616 
## F-statistic: 124.5 on 5 and 1738 DF,  p-value: < 2.2e-16

Our final model seems to meet all of the assumptions of linear regression.

Our final fitted model is:

\(\log(\widehat{\text{Cases}}_i)\) \(= 1.28561 +\) \(1.16493\times\text{machine1}_i\) + \(2.05135\times\text{machine2}_i\) + \(2.39578\times\text{machine3}_i\) + \(0.23112\times\text{RateL3}_i -0.35038 \times\text{BrokenBK}_i\)


Model Assessment

Now that I have a model that describes the data well with all assumptions met, I would like to use the model to make inferences and predictions. I are interested in creating confidence intervals for the selected variables, as well as getting predictions for new cities. I are particularly interested in the correlation between predicted average cases and machine status.

summary(mock_lm_inter_trans)
## 
## Call:
## lm(formula = Log_Cases ~ machine1 + machine2 + machine3 + RateL3 + 
##     BrokenBK, data = mock_select)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5681 -0.4844  0.1144  0.5764  2.2109 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.28561    0.11270  11.408  < 2e-16 ***
## machine1     1.16493    0.11203  10.399  < 2e-16 ***
## machine2     2.05135    0.11801  17.383  < 2e-16 ***
## machine3     2.39578    0.16560  14.467  < 2e-16 ***
## RateL3       0.23112    0.03971   5.820 7.00e-09 ***
## BrokenBK    -0.35038    0.07899  -4.435 9.76e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8134 on 1738 degrees of freedom
## Multiple R-squared:  0.2637, Adjusted R-squared:  0.2616 
## F-statistic: 124.5 on 5 and 1738 DF,  p-value: < 2.2e-16
confint(mock_lm_inter_trans, level = 0.95)
##                  2.5 %     97.5 %
## (Intercept)  1.0645715  1.5066407
## machine1     0.9452088  1.3846474
## machine2     1.8198975  2.2827990
## machine3     2.0709728  2.7205847
## RateL3       0.1532300  0.3090118
## BrokenBK    -0.5053131 -0.1954449

All F-test and p-value of each variable all suggests the predictors has significant correlation with the sale cases. Generally, when there are more machines at the store for more than one years, the average sales case will increase, and a clean, modern store also increase the average case sales. On the other hand, a broken machine will decrease the average sales. The baseline of the model is zero machine at the store for more than one year and no broken machine at the store.

The confidence intervals for the variables are very informative. For example, I are 95% confident that, on average, when there are three machines at the store for more than one year with non-broken machine and the store is rated as filthy, the cases sale will increase from 207% to 272% comparing to the store with zero machines at the store for more than one year. And when there is one machines at the store for more than one year, on average, the cases sale will increase from 94% to 138% comparing to the store with zero machines at the store for more than one year, which suggests that increasing the number of machine at the store will have a positive influence on cases sales. On the other hand, holding all else constant, I are 95% confident that a broken machine at the store will lower the sale from 20% to 50% on average, comparing no broken machine at the store. This result indicates that maintain a functional machine is crucial for a higher case sale. Despite not significant, I are 95% confident that a clean, modern store will have a case sale increase from 15% to 30% on average, holding all else constant. A customer friendly store will have positive influence for the cases sale.

I are also interested in how well the model fits the data. To do this, I look at metrics such as the RMSE. These metrics are important to check and understand because I can know how well the model is fitting.

# MSE, RMSE
anova <- aov(mock_lm_inter_trans)  # get ANOVA components
mock_lm_inter_trans_anova <- summary(anova)[[1]]  # save data in a usable form
mse = mock_lm_inter_trans_anova["Residuals", "Sum Sq"] / 
  mock_lm_inter_trans_anova["Residuals", "Df"]
(sqrt(mse))
## [1] 0.8133812

The RMSE represents the average amount of spread in the residuals. The RMSE indicates that the average error in the model is around 0.81 sales. Since the mean value is 2.987, the average error perform by the model might be the potential concern when interpreting the data from the analysis.


Summary and Conclusions

Our business model will maintain a higher number of Frazil-owned machines. Therefore, understanding the correlation between machine utilization and cases sale is critical to the success of out business, and we want to see what the correlation is to predict future sale and effectively allocating the machines. The linear model shows there are significant relationships between cases-machine , cases-brokenMachine, and cases-storeRate. Generally, more machines at the store for more than one year will have a higher average cases sale. When a machine is broken at the store will lower the average cases sale, and a clean , modern store will have a higher average cases. In a word, while a store maintain the relationship with Frazil for more than one year or even have multiple machine, the average cases sale can increase from 100% to 270% respectably. Keeping stable relationship with the restaurants could promote the sales. In addition, for those store have clean environment and sustain the machine at the working condition, the average sale cases will be higher. The company should develop stable partnership with the restaurant that have high hygiene standard and cautious about the machines.